-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ccl/sqlproxyccl: complete connection migration support in the forwarder #76805
ccl/sqlproxyccl: complete connection migration support in the forwarder #76805
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The structure for the connection migration work is there. Note that this is untested, and was inspired mostly from the initial prototype. Putting this up in case anyone has thoughts on the direction. I'll be adding tests next.
When it comes to the forwarder's API, I plan to have a Forwarder interface that implements just two methods: RequestTransfer
and Close
. With the current design, callers wouldn't know whether the transfer has failed or succeeded after calling RequestTransfer
. I think that is fine for now, and if we really need to know the last transfer error, we could address that in the upcoming RFC. That said, the initial thought is to have the forwarder notify a "forwarder tracking" component (i.e. map of pods to forwarders) if it connects/disconnects from serverConn
. The proxy can then check which pod the forwarder is assigned to (versus the previous pod) to know whether the transfer has succeeded.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)
pkg/ccl/sqlproxyccl/forwarder.go, line 52 at r1 (raw file):
) var (
These error messages might be unnecessary, at least for the ones where they are only used once. We could just do return errors.New
there if we wanted to.
4797bdf
to
009f2d6
Compare
009f2d6
to
e7730d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
This draft PR is ready for another look. I added more tests, and changed how we handle timeout handlers. I also tested this, and it works as expected.
This test is still WIP, and is not included in this PR. Will update that later today. Here, we call RequestTransfer
in a loop, and constantly query the SQL pod details. We can see that we alternate between pods, which implies that connection migration works:
=== RUN TestConnectionMigration/successful_connection
{postgresql://[email protected]:39917/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://[email protected]:36529/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://[email protected]:39917/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://[email protected]:36529/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://[email protected]:39917/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://[email protected]:36529/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://[email protected]:39917/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://[email protected]:36529/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://[email protected]:39917/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://[email protected]:36529/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
There are three more items to address:
- Implement deserialize_session. Given that this PR is huge, I might just push that to a subsequent PR.
- Perhaps move the changes in authentication.go in another PR. One thing that was overlooked is that the server still sends back initial connection messages in the token-based auth approach, so we'd have to slurp those messages.
- Add
TestConnectionMigration
(which is an end-to-end test) to this PR.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)
pkg/ccl/sqlproxyccl/forwarder.go, line 269 at r2 (raw file):
Previously, JeffSwenson (Jeff Swenson) wrote…
Can we do this without atomics? I do not like using atomics for control flow. Getting atomic programming right is infamously difficult. In this case there is a race condition that occurs if RequestTransfer is called concurrently and it's not clear to me how to fix it. At a minimum CompareAndSwap needs to be used as the guard for this state transfer. But then there is a race with setting transferKey and transferCh.
Done - good catch re: concurrent calls. As discussed, I have moved this to use locks. My benchmarks show that using locks within a 30s workload run only slows the overall time by a few hundred ms, so I'd be fine with using locks everywhere instead of a mixture of both.
pkg/ccl/sqlproxyccl/forwarder.go, line 440 at r2 (raw file):
Previously, JeffSwenson (Jeff Swenson) wrote…
idea: Instead of implementing the forwarding goroutines as a state machine, initiating the transfer could ask both of the goroutines to terminate. That allows us to implement transfer the same way authenticate is implemented. As a single function and goroutine. Once the transfer is canceled or completed, the goroutine managing the transfer could restart the forwarding goroutines.
IMO the main advantage to implementing it by stopping the forwarding goroutines is we can keep the steady state code simple and we can group all of the transfer logic together.
As discussed, this would increase the overall transfer latency since we'd have to wait for the server-to-client processor to stop before sending the transfer state message, so I kept the current approach. I like the idea as it makes things cleaner though, but the tradeoff wouldn't be worth it.
pkg/ccl/sqlproxyccl/forwarder.go, line 514 at r2 (raw file):
Previously, JeffSwenson (Jeff Swenson) wrote…
This logic can be simplified by context.WithTimeout.
transferCtx, cancel := context.WithTimeout(timeout, ctx)
defer cancel()
err := transfer(trasferCtx)
if transferCtx.Err() && !ctx.Err() {
return errTransferTimeout
}
return errAlternatively, if we do not specifically care about errTransferTimeout, it can use context.WithTimeout.
Hm, this probably isn't relevant anymore since I moved it elsewhere.
e7730d1
to
6a8d610
Compare
6a8d610
to
87b8ea5
Compare
87b8ea5
to
e3bb0d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are three more items to address:
1 .Implement deserialize_session. Given that this PR is huge, I might just push that to a subsequent PR.
2. Perhaps move the changes in authentication.go in another PR. One thing that was overlooked is that the server still sends back initial connection messages in the token-based auth approach, so we'd have to slurp those messages.
3. Add TestConnectionMigration (which is an end-to-end test) to this PR.
(2) and (3) are done. I've decided to just leave them in this PR, but happy to break them down into separate PRs if reviewers prefer otherwise. (1) will definitely come in a later PR.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)
pkg/ccl/sqlproxyccl/forwarder.go, line 629 at r3 (raw file):
// NOTE: This should only be called during a transfer process. We assume that // transferCloserCh has already been initialized. func (f *forwarder) runTransferTimeoutHandler(waitTimeout time.Duration) {
The end-to-end transfer process works, but I'm not happy with this function. The unfortunate thing is that we unblock clients through context, but for servers, we'd do that by closing the forwarder. This goroutine is needed to ensure that we can terminate if we timeout when sending the SHOW TRANSFER STATE
query to the server. Also perhaps I ran into a rabbit hole with all the transferConnRecoverable
stuff, but I don't see a better way to address that. Happy to discuss.
closer
is needed instead of relying on transferCtx.Done()
to ensure that we can distinguish between a transfer timeout, and an actual transfer completion. If we block on transferCtx.Done()
in the client-to-server processor, then when the timeout handler gets triggered, we will unblock automatically, which is incorrect.
Informs cockroachdb#76000. Extracted from cockroachdb#76805. This commit adds helpers related to connection migration. This includes support for retrieving the transfer state through SHOW TRANSFER STATE, as well as deserializing the session through crdb_internal.deserialize_session. We also introduce a new error marker for errors that denote connections which are recoverable. In our case, we convert SQL errors from first column of the SHOW TRANSFER STATE query (e.g. serialization errors) to Go errors, and mark them with the errConnRecoverableSentinel sentinel. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, I extracted some of the work here in #77109. I'll do the same for the auth stuff.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)
Informs cockroachdb#76000. Extracted from cockroachdb#76805. This commit adds helpers related to connection migration. This includes support for retrieving the transfer state through SHOW TRANSFER STATE, as well as deserializing the session through crdb_internal.deserialize_session. We also introduce a new error marker for errors that denote connections which are recoverable. In our case, we convert SQL errors from first column of the SHOW TRANSFER STATE query (e.g. serialization errors) to Go errors, and mark them with the errConnRecoverableSentinel sentinel. Release note: None
Informs cockroachdb#76000. Extracted from cockroachdb#76805. Previously, we assumed that with the token-based authentication, the server is ready to accept queries the moment we connect to it. This assumption has been proved wrong during the integration tests with the forwarder, and there's an implicit AuthenticateOK step that happens after connecting to the server. During that time, initial connection data such as ParameterStatus and BackendKeyData messages will be sent to the client as well. For now, we will ignore those messages. Once we start implementing query cancellation within the proxy, that has to be updated to cache the new BackendKeyData. Release note: None
Informs cockroachdb#76000. Extracted from cockroachdb#76805. This commit adds helpers related to connection migration. This includes support for retrieving the transfer state through SHOW TRANSFER STATE, as well as deserializing the session through crdb_internal.deserialize_session. We also introduce a new error marker for errors that denote connections which are recoverable. In our case, we convert SQL errors from first column of the SHOW TRANSFER STATE query (e.g. serialization errors) to Go errors, and mark them with the errConnRecoverableSentinel sentinel. Release note: None
e3bb0d3
to
4357e45
Compare
a095f21
to
cd6e8dd
Compare
2fdda2d
to
fddfd30
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - this is ready for another round of review. Some tests are TODO, which will be addressed before merging, but I'm putting this up for a review first. Only look at the last three commits.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, and @jeffswenson)
97f79e3
to
9661512
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great! I have a few minor comments.
} | ||
}() | ||
|
||
f.mu.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mutex is held for most of the transfer process. The processors are thread safe, and there is the isTransferring state, so we can avoid most of this locking if we create helpers for retrieving and setting the processors.
The code as written is correct. I like to keep critical sections relatively short in an effort to reduce the probability of deadlocks and contention.
I'm thinking something like:
func (f* forwarder) getProcessors() (request processor, response processor) {
f.mu.Lock()
defer f.mu.Unlock()
return f.mu.request, f.mu.response
}
func (f* forwarder) replaceServerConn(newServerConn net.Conn) {
f.mu.Lock()
defer f.mu.Unlock()
// Transfer was successful.
clockFn := makeLogicalClockFn()
f.mu.serverConn.Close()
f.mu.serverConn = newServerConn
f.mu.request = newProcessor(clockFn, f.mu.clientConn, f.mu.serverConn)
f.mu.response = newProcessor(clockFn, f.mu.serverConn, f.mu.clientConn)
}
...
request, response := getProcessors()
// Suspend both processors before starting the transfer.
if err := request.suspend(ctx); err != nil {
return errors.Wrap(err, "suspending request processor")
}
if err := response.suspend(ctx); err != nil {
return errors.Wrap(err, "suspending response processor")
}
// Transfer the connection.
newServerConn, err := transferConnection(ctx, f.connector, f.mu.clientConn, f.mu.serverConn)
if err != nil {
return errors.Wrap(err, "transferring connection")
}
f.replaceServerConn(newServerConn)
return nil
This commit adds last message details to the forwarder's processor, and in particular, lastMessageType and lastMessageTransferredAt. The latter is implemented using a simple logical clock, which will be used to determine ordering of events. These two will be used during connection migration to determine a safe transfer point. At the same time, we plumb the connector and metrics into the processor, which will be used in subsequent commits. Release justification: Low risk, sqlproxy-only change. Release note: None
…resumption Previously, there could be a race where suspend() was called right after resuming the processors. If the processor goroutines have not started, suspend will implicitly return, leading to a violation of an invariant, where we want the processors to be suspended before proceeding. This commit adds a new waitResumed method on the processor that allows callers to block until the processors have been resumed. Release justification: sqlproxy-only change. Release note: None
9661512
to
d6a7488
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)
pkg/ccl/sqlproxyccl/conn_migration.go, line 125 at r22 (raw file):
Previously, JeffSwenson (Jeff Swenson) wrote…
I don't think logCtx is necessary. I looked at the implementation of log and I did not find any calls that check ctx.Err(). If it did, I would argue that is a bug. It is normal to log something that goes wrong during the cancellation process.
I didn't have this initially, but the process panics with a span-use-after-close error:
panic: use of Span after Finish. Span: proxy-con-serve. Finish previously called at: <stack not captured. Set debugUseAfterFinish>
goroutine 2224 [running]:
github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).detectUseAfterFinish(0xb7faeb8)
/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:185 +0x134
github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).RecordingType(0xc0011d2900)
/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:387 +0x1e
github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).IsVerbose(...)
/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:395
github.com/cockroachdb/cockroach/pkg/util/log.getSpanOrEventLog({0x8f543d0, 0xc005301580})
/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/trace.go:92 +0x3a
github.com/cockroachdb/cockroach/pkg/util/log.logfDepth({0x8f543d0, 0xc005301580}, 0x1, 0x1, 0x0, {0x7d9f468, 0x1f}, {0xc0051a7d60, 0x1, 0x1})
/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/channels.go:55 +0x2a8
github.com/cockroachdb/cockroach/pkg/util/log.Infof(...)
/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/log_channels_generated.go:263
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).runTransfer.func2()
/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/conn_migration.go:145 +0x1f8
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).runTransfer(0xc00517d200)
/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/conn_migration.go:174 +0x5d7
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).RequestTransfer.func1()
/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:154 +0x1d
created by github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).RequestTransfer
/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:154 +0x5b
FAIL github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl 2.852s
pkg/ccl/sqlproxyccl/conn_migration.go, line 148 at r22 (raw file):
Previously, JeffSwenson (Jeff Swenson) wrote…
This mutex is held for most of the transfer process. The processors are thread safe, and there is the isTransferring state, so we can avoid most of this locking if we create helpers for retrieving and setting the processors.
The code as written is correct. I like to keep critical sections relatively short in an effort to reduce the probability of deadlocks and contention.
I'm thinking something like:
func (f* forwarder) getProcessors() (request processor, response processor) { f.mu.Lock() defer f.mu.Unlock() return f.mu.request, f.mu.response } func (f* forwarder) replaceServerConn(newServerConn net.Conn) { f.mu.Lock() defer f.mu.Unlock() // Transfer was successful. clockFn := makeLogicalClockFn() f.mu.serverConn.Close() f.mu.serverConn = newServerConn f.mu.request = newProcessor(clockFn, f.mu.clientConn, f.mu.serverConn) f.mu.response = newProcessor(clockFn, f.mu.serverConn, f.mu.clientConn) } ... request, response := getProcessors() // Suspend both processors before starting the transfer. if err := request.suspend(ctx); err != nil { return errors.Wrap(err, "suspending request processor") } if err := response.suspend(ctx); err != nil { return errors.Wrap(err, "suspending response processor") } // Transfer the connection. newServerConn, err := transferConnection(ctx, f.connector, f.mu.clientConn, f.mu.serverConn) if err != nil { return errors.Wrap(err, "transferring connection") } f.replaceServerConn(newServerConn) return nil
Done. Updated it to your suggestion, but also introduced getConns
for the client and server connections.
pkg/ccl/sqlproxyccl/forwarder.go, line 190 at r18 (raw file):
Previously, JeffSwenson (Jeff Swenson) wrote…
😄 On a 100gbps ethernet line, sending the smallest possible pg message, with no protocol overhead, it would take 217 years of continuous transmission to overflow an int64 counter.
>>> line_rate = 100 * 2 ** 30 # 100 gibs >>> message_size = 5 * 8 # minimum message size of 8 bytes >>> msg_per_second = line_rate / message_size >>> seconds_to_overflow = 2 ** 64 / msg_per_second >>> years_to_overflow = seconds_to_overflow / (60 * 60 * 24 * 365) >>> years_to_overflow 217.908031253171
😄
pkg/ccl/sqlproxyccl/forwarder.go, line 320 at r18 (raw file):
It would require passing peekErr and typ as arguments.
Doing this muddies the logic in exitPeek, and I'd rather keep it this way with the extra mutex lock+unlock. If this proves to be an issue, we could easily optimize this at a later time. Doing it the current way also makes sure that updating the last message details is tied to the ForwardMsg call, and not when exiting.
pkg/ccl/sqlproxyccl/forwarder.go, line 83 at r20 (raw file):
Done. Added a note about that.
Generally I would advise against acquiring multiple locks at the same time because it is a deadlock risk.
I think as long as we're aware about it, such risks can be minimized. Also, using a single lock for all three structs is going to cause contention between the request and response processors especially when queries are pipelined.
d6a7488
to
64166a9
Compare
Informs cockroachdb#76000. This commit completes the connection migration feature in the the forwarder within sqlproxy. The idea is as described in the RFC. A couple of new sqlproxy metrics have been added as well: - proxy.conn_migration.success - proxy.conn_migration.error_fatal - proxy.conn_migration.error_recoverable - proxy.conn_migration.attempted For more details, see metrics.go in the sqlproxyccl package. Release justification: This completes the first half of the connection migration feature. This is low risk as part of the code is guarded behind the connection migration feature, which is currently not being used in production. To add on, CockroachCloud is the only user of sqlproxy. Release note: None
64166a9
to
8610953
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, and @jeffswenson)
pkg/ccl/sqlproxyccl/conn_migration.go, line 125 at r22 (raw file):
Previously, jaylim-crl (Jay Lim) wrote…
I didn't have this initially, but the process panics with a span-use-after-close error:
panic: use of Span after Finish. Span: proxy-con-serve. Finish previously called at: <stack not captured. Set debugUseAfterFinish> goroutine 2224 [running]: github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).detectUseAfterFinish(0xb7faeb8) /Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:185 +0x134 github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).RecordingType(0xc0011d2900) /Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:387 +0x1e github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).IsVerbose(...) /Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:395 github.com/cockroachdb/cockroach/pkg/util/log.getSpanOrEventLog({0x8f543d0, 0xc005301580}) /Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/trace.go:92 +0x3a github.com/cockroachdb/cockroach/pkg/util/log.logfDepth({0x8f543d0, 0xc005301580}, 0x1, 0x1, 0x0, {0x7d9f468, 0x1f}, {0xc0051a7d60, 0x1, 0x1}) /Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/channels.go:55 +0x2a8 github.com/cockroachdb/cockroach/pkg/util/log.Infof(...) /Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/log_channels_generated.go:263 github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).runTransfer.func2() /Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/conn_migration.go:145 +0x1f8 github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).runTransfer(0xc00517d200) /Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/conn_migration.go:174 +0x5d7 github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).RequestTransfer.func1() /Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:154 +0x1d created by github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).RequestTransfer /Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:154 +0x5b FAIL github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl 2.852s
Jay and I dug into this off line. The root cause is the sqlproxy is using stopper incorrectly. This will be fixed in a separate PR. #77689
pkg/ccl/sqlproxyccl/forwarder.go, line 158 at r16 (raw file):
Previously, jaylim-crl (Jay Lim) wrote…
@jeffswenson, how strongly do you feel that we need to guard these variables with a mutex when they don't need them? I've attempted to do so, but it was extremely difficult to read/follow, so I aborted the idea.
This was resolved by the redesign.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)
pkg/ccl/sqlproxyccl/conn_migration.go, line 125 at r22 (raw file):
The root cause is the sqlproxy is using stopper incorrectly.
I guess it's more like we're not using the stopper entirely. Some places use stopper.RunAsyncTask
, whereas others use the go
keyword directly. The latter is handy as we don't need to thread the stopper all the way down. Fixing all uses of the go
keyword isn't straightforward. We'd also have to rethink how we want traces to be displayed.
bors r=JeffSwenson |
Build succeeded: |
Previously, we added support for connection migration in cockroachdb#76805. This commit adds a new `proxy.conn_migration.attempted.latency` metric that tracks latency for attempted connection migrations. Having this metric would be beneficial as we can now know how long users were blocked during connection migrations. Release justification: Low-risk metric-only change within sqlproxy. Release note: None
…e size Informs cockroachdb#76000, and follow up to cockroachdb#76805. This commit adds a new proxy.conn_migration.transfer_response.message_size metric that will track the distribution of the transfer response message size. This will be used to tune a value for the SQL cluster setting: sql.session_transfer.max_session_size. Release justification: Low-risk metric-only change within sqlproxy. Release note: None
Previously, we added support for connection migration in cockroachdb#76805. This commit adds a new `proxy.conn_migration.attempted.latency` metric that tracks latency for attempted connection migrations. Having this metric would be beneficial as we can now know how long users were blocked during connection migrations. Release justification: Low-risk metric-only change within sqlproxy. Release note: None
…e size Informs cockroachdb#76000, and follow up to cockroachdb#76805. This commit adds a new proxy.conn_migration.transfer_response.message_size metric that will track the distribution of the transfer response message size. This will be used to tune a value for the SQL cluster setting: sql.session_transfer.max_session_size. Release justification: Low-risk metric-only change within sqlproxy. Release note: None
77442: sqlproxyccl: Add codeUnavailable to the list of error codes r=alyshanjahani-crl a=alyshanjahani-crl Release justification: fixes for high-priority bug in existing functionality Previously, if a tenant cluster had maxPods set to 0 the error returned by directory.EnsureTenantAddr was not treated as a non-retryable error. The tenant directory implementation used in CockroachCloud now identifies this situation and returns a status error with codes.FailedPrecondition and a descriptive message. This patch adds a check for the FailedPrecondition error in connector.lookupAddr. Release note: None 77626: sql: re-enable txnidcache by default r=maryliag,ajwerner a=Azhng Previously, TxnID cache was disabled by default due to performance issues. This has since been resolved in #77220. This commit re-enables TxnID Cache by default. Release justification: low risk high reward change. Release note: None 77675: cmd/roachtest: tweaks to sstable corruption test r=itsbilal a=nicktrav A few minor changes to the existing roachtest: Reduce the warehouse count on ingest back to 100 - this should be ample, and it matches the warehouse count on the read path in a later part of the test. Change the stop options to wait until the process has been terminated before continuing. Re-work the command to search for SSTables to corrupt - use tables from the most recent manifest (if multiple are present); consider SSTables with either a start or end key representing a user table; shuffle tables to distribute corruption over the LSM; perform filtering in the bash command rather than in the test runner itself. Release justification: Roachtest-only change. Release note: None. Touches #77321. 77700: ccl/sqlproxyccl: add connection migration-related metrics r=JeffSwenson a=jaylim-crl #### ccl/sqlproxyccl: add metric that measures connection migration latency Previously, we added support for connection migration in #76805. This commit adds a new `proxy.conn_migration.attempted.latency` metric that tracks latency for attempted connection migrations. Having this metric would be beneficial as we can now know how long users were blocked during connection migrations. Release justification: Low-risk metric-only change within sqlproxy. Release note: None #### ccl/sqlproxyccl: add metric that records the transfer response message size Informs #76000, and follow up to #76805. This commit adds a new proxy.conn_migration.transfer_response.message_size metric that will track the distribution of the transfer response message size. This will be used to tune a value for the SQL cluster setting: sql.session_transfer.max_session_size. Release justification: Low-risk metric-only change within sqlproxy. Release note: None Co-authored-by: Alyshan Jahani <[email protected]> Co-authored-by: Azhng <[email protected]> Co-authored-by: Nick Travers <[email protected]> Co-authored-by: Jay <[email protected]>
ccl/sqlproxyccl: add last message details to the forwarder's processor
This commit adds last message details to the forwarder's processor, and in
particular, lastMessageType and lastMessageTransferredAt. The latter is
implemented using a simple logical clock, which will be used to determine
ordering of events. These two will be used during connection migration to
determine a safe transfer point.
At the same time, we plumb the connector and metrics into the processor, which
will be used in subsequent commits.
Release justification: Low risk, sqlproxy-only change.
Release note: None
ccl/sqlproxyccl: support waitResumed on the processor to block until resumption
Previously, there could be a race where suspend() was called right after
resuming the processors. If the processor goroutines have not started, suspend
will implicitly return, leading to a violation of an invariant, where we want
the processors to be suspended before proceeding. This commit adds a new
waitResumed method on the processor that allows callers to block until the
processors have been resumed.
Release justification: sqlproxy-only change.
Release note: None
ccl/sqlproxyccl: complete connection migration support in the forwarder
Informs #76000.
This commit completes the connection migration feature in the the forwarder
within sqlproxy. The idea is as described in the RFC.
A couple of new sqlproxy metrics have been added as well:
For more details, see metrics.go in the sqlproxyccl package.
Release justification: This completes the first half of the connection
migration feature. This is low risk as part of the code is guarded behind the
connection migration feature, which is currently not being used in production.
To add on, CockroachCloud is the only user of sqlproxy.
Release note: None